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Abstract 

In this paper we study the problem of reconstruction of a low- 
rank matrix observed with additive Gaussian noise. First we show 
that under mild assumptions (about the prior distribution of the sig- 
nal matrix) we can restrict our attention to reconstruction methods 
that are based on the singular value decomposition of the observed 
matrix and act only on its singular values (preserving the singular 
, vectors). Then we determine the effect of noise on the SVD of low- 

>► rank matrices by building a connection between matrix reconstruction 

problem and spiked population model in random matrix theory. Based 
I on this knowledge, we propose a new reconstruction method, called 

■^J- RMT, that is designed to reverse the effect of the noise on the singu- 

la — lar values of the signal matrix and adjust for its effect on the singular 

CD vectors. With an extensive simulation study we show that the pro- 

posed method outperform even oracle versions of both soft and hard 
thresholding methods and closely matches the performance of a gen- 
eral oracle scheme. 
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1 Introduction 

Existing and emerging technologies provide scientists with access to a grow- 
ing wealth of data. Some data is initially produced in the matrix form, while 
other can be represented in a matrix form once the data from multiple sam- 
ples is combined. The data is often measured with noise due to limitations 
of the data generating technologies. To reduce the noise we need some in- 
formation about the possible structure of signal component. In this paper 



we assume the signal to be a low rank matrix. Assumption of this sort ap- 
pears in multiple fields of study including genomics (Raychaudhuri et al., 
2000; Alter et al., 2000; Holter et al., 2000; Wall et al., 2001; Troyanskaya 
et al., 2001), compressed sensing (Candes et al., 2006; Candes and Recht, 
Candes and Recht; Donoho, 2006), and image denoising (Wongsawat, Rao, 
and Oraintara; Konstantinides et al., 1997). In many cases the signal matrix 
is known to have low rank. For example, a matrix of squared distances be- 
tween points in d- dimensional Euclidean space is know to have rank at most 
d + 2. A correlation matrix for a set of points in (^-dimensional Euclidean 
space has rank at most d. In other cases the target matrix is often assumed 
to have low rank, or to have a good low-rank approximation. 

In this paper we address the problem of recovering a low rank signal 
matrix whose entries are observed in the presence of additive Gaussian noise. 
The reconstruction problem considered here has a signal plus noise structure. 
Our goal is to recover an unknown mxn matrix A of low rank that is observed 
in the presence of i.i.d. Gaussian noise as matrix Y: 

Y = A+^W, where W« ~ i.i.d. N(0, 1). 

'n 



The factor n -1 / 2 ensures that the signal and noise are comparable, and is 
employed for the asymptotic study of matrix reconstruction in Section 3. In 
what follows, we first consider the variance of the noise a 2 to be known, and 
assume that it is equal to one. (In Section 4.1 we propose an estimator for 
cr, which we use in the proposed reconstruction method.) In this case the 
model (1) simplifies to 

Y = A+^W, where W« ~ i.i.d. M0, 1). (1) 

*n 



Formally, a matrix recovery scheme is a map g : ]R mx?1 — >• ^ mxn f r0 m 
the space ofmxn matrices to itself. Given a recovery scheme g(-) and an 
observed matrix Y from the model (1), we regard A = g(Y) as an estimate 
of A, and measure the performance of the estimate A by 

Loss(A,A) = \\A-A\\ 2 FJ (2) 

where \\ ■ \\f denotes the Frobenius norm. The Frobenius norm of an m x n 
matrix B = {%} is given by 



\m% = £I> 
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Note that if the vector space ]R mxn is equipped with the inner product 
{A, B) = tr(A'B), then \\B\\% = {B,B}. 

1.1 Hard and Soft Thresholding 

A natural starting point for reconstruction of the target matrix A in (1) is 
the singular value decomposition (SVD) of the observed matrix Y. Recall 
that the singular value decomposition of an m x n matrix Y is given by the 
factorization 

mAn 

Y = UDV = J2 d i u i v 'j- 
3=1 

Here U is an m x m orthogonal matrix whose columns are the left singular 
vectors Uj, V is an n x n orthogonal matrix whose columns are the right 
singular vectors Vj, and D is an m x n matrix with singular values dj = 
Djj > on the diagonal and all other entries equal to zero. Although it 
is not necessarily square, we will refer to D as a diagonal matrix and write 
D = diag(di, . . . , d mAn ), where mAn denotes the minimum of m and n. 

Many matrix reconstruction schemes act by shrinking the singular values 
of the observed matrix towards zero. Shrinkage is typically accomplished by 
hard or soft thresholding. Hard thresholding schemes set every singular value 
of Y less than a given positive threshold A equal to zero, leaving other singular 
values unchanged. The family of hard thresholding schemes is defined by 

mAn 

9x(Y) = ]T^(^>A)^;, A>0. 

3=1 

Soft thresholding schemes subtract a given positive number v from each 
singular value, setting values less than v equal to zero. The family of soft 
thresholding schemes is defined by 

mAn 
3=1 

Hard and soft thresholding schemes can be defined equivalently in the penal- 
ized forms 

gf(Y) = argmin{||F- J B||| + A 2 rank( J B)} 

B 



g s v {Y) = argnjin{||y-S|||, + 2i/||B||J. 

B 

In the second display, ||S||* denotes the nuclear norm of B, which is equal 
to the sum of its singular values. 
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Figure 1: Scree plot for a 1000 x 1000 rank 2 signal matrix with noise. 



In practice, hard and soft thresholding schemes require estimates of the 
noise variance, as well as the selection of appropriate cutoff or shrinkage 
parameters. There are numerous methods in the literature for choosing the 
hard threshold A. Heuristic methods often make use of the scree plot, which 
displays the singular values of Y in decreasing order: A is typically chosen 
to be the y-coordinate of a well defined "elbow" in the resulting curve. A 
typical scree plot for a 1000 x 1000 matrix with rank 2 signal is shown in 
Figure 1. The "elbow" point of the curve on the plot clearly indicate that 
the signal has rank 2. 

A theoretically justified selection of hard threshold A is presented in re- 
cent work of Bunea et al. (2010). They also provide performance guaranties 
for the resulting hard thresholding scheme using techniques from empirical 
process theory and complexity regularization. Selection of the soft threshold- 
ing shrinkage parameter v may also be accomplished by a variety of methods. 
Negahban and Wainwright (2009) propose a specific choice of v and provide 
performance guarantees for the resulting soft thresholding scheme. 

Figure 2 illustrates the action of hard and soft thresholding on a 1000 x 
1000 matrix with a rank 50 signal. The blue line indicates the singular values 
of the signal A and the green line indicates the those of the observed matrix Y . 
The plots show the singular values of the hard and soft thresholding estimates 
incorporating the best choice of the parameters A and u, respectively. It is 
evident from the figure that neither thresholding scheme delivers an accurate 



1000 x 1000 Matrix, Rank 50 Signal 



1000 x 1000 Matrix, Rank 50 Signal 











Singular values of A 

Singular values of Y 

Best hard thresholding 


- 


^- 


- 









Singular values ol A 

Singular values ol Y - 

Best so It thresholding 


^T 


- 


\ 



10 20 30 40 50 60 70 



10 20 30 40 50 60 70 



Figure 2: Singular values of hard and soft thresholding estimates. 



estimate of the signal's singular values. Moreover examination of the loss 
indicates that they do not provide a good estimates of the signal matrix. 

The families of hard and soft thresholding methods encompass many ex- 
isting reconstruction schemes. Both thresholding approaches seek low rank 
(sparse) estimates of the target matrix, and both can be naturally formulated 
as optimization problems. However, the family of all reconstruction schemes 
is much larger, and it is natural to consider alternatives to hard and soft 
thresholding that may offer better performance. 

In this paper, we start with a principled analysis of the matrix recon- 
struction problem, with the effort of making as few assumptions as possible. 
Theoretically motivated design of the method. Based on the analysis of the 
reconstruction problem and, in particular, the analysis of the effect of noise 
on low-rank matrices we design a new reconstruction method with a theoret- 
ically motivated design. 



1.2 Outline 

We start the paper with an analysis of the finite sample properties of the 
matrix reconstruction problem. The analysis does not require the matrix A 
to have low rank and only requires that the distribution of noise matrix W is 
orthogonally invariant (does not change under left and right multiplications 
by orthogonal matrices). Under mild conditions (the prior distribution of A 
must be orthogonally invariant) we prove that we can restrict our attention to 
the reconstruction methods that are based on the SVD of the observed matrix 
Y and act only on its singular values, not affecting the singular vectors. 



This result has several useful consequences. First, it reduces the space of 
reconstruction schemes we consider from g : R mxn — y ]^ mxri to just ]R mAn — y 
l mAn . Moreover, it gives us the recipe for design of the new reconstruction 
scheme: we determine the effect of the noise on the singular values and the 
singular value decomposition of the signal matrix A and then built the new 
reconstruction method to reverse the effect of the noise on the singular values 
of A and account for its effect on the singular vectors. 

To determine the effect of noise on low-rank signal we build a connection 
between the matrix reconstruction problem and spiked population models 
in random matrix theory. Spiked population models were introduced by 
Johnstone (2001). The asymptotic matrix reconstruction model that matches 
the setup of spiked population models assumes the rank of A and its non-zero 
singular values to be fixed as the matrix dimensions to grow at the same rate: 
m, n — y oo and m/n — > c > 0. We use results from random matrix theory 
about the limiting behavior of the eigenvalues (Marcenko and Pastur, 1967; 
Wachter, 1978; Geman, 1980; Baik and Silverstein, 2006) and eigenvectors 
(Paul, 2007; Nadler, 2008; Lee et al., 2010) of sample covariance matrices in 
spiked population models to determine the limiting behavior of the singular 
values and the singular vectors of matrix Y. 

We apply these results to design a new matrix reconstruction method, 
which we call RMT for its use of random matrix theory. The method es- 
timated the singular values of A from the singular values of Y and applies 
additional shrinkage to them to correct for the difference between the singu- 
lar vectors of Y and those of A. The method uses an estimator of the noise 
variance which is based on the sample distribution of the singular values of 
Y corresponding to the zero singular values of A. 

We conduct an extensive simulation study to compare RMT method 
against the oracle versions of hard and soft thresholding and the orthogo- 
nally invariant oracle. We run all four methods on matrices of different sizes, 
with signals of various ranks and spectra. The simulations clearly show that 
RMT method strongly outperforms the oracle versions of both hard and soft 
thresholding methods and closely matches the performance of the orthogo- 
nally invariant oracle (the oracle scheme that acts only on the singular values 
of the observed matrix) . 

The paper is organized as follows. In Section 2 we present the analysis 
of the finite sample properties of the reconstruction problem. In Section 3 
we determine the effect of the noise on the singular value decomposition of 



row rank matrices. In Section 4 we construct the proposed reconstruction 
method based on the results of the previous section. The method employs 
the noise variance estimator presented in Section 4.1. Finally, in Section 5 we 
present the simulation study comparing RMT method to the oracle versions 
of hard and soft thresholding and the orthogonally invariant oracle method. 

2 Orthogonally Invariant Reconstruction 
Methods 

The additive model (1) and Frobenius loss (2) have several elementary in- 
variance properties, that lead naturally to the consideration of reconstruction 
methods with analogous forms of invariance. Recall that a square matrix U 
is said to be orthogonal if UU' = U'U = I, or equivalently, if the rows (or 
columns) of U are orthonormal. If we multiply each side of (1) from the left 
right by orthogonal matrices U and V of appropriate dimensions, we obtain 

UYV = UAV + ^=UWV'. (3) 



Proposition 1. Equation (3) is a reconstruction problem of the form (1) 
with signal UAV and observed matrix UYV . If A is an estimate of A in 
model (1), then UAV is an estimate of UAV in model (3) with the same 
loss. 

Proof. If A has rank r then UAV also has rank r. Thus prove the first 
statement, it suffices to show that UWV in (3) has independent N(0, 1) 
entries. This follows from standard properties of the multivariate normal 
distribution. In order to establish the second statement of the proposition, 
let U and V be the orthogonal matrices in (3). For any m x n matrix B, 

\\UB\\ 2 F = tr[(UB)'(UB)] = tr [B'B] = \\B\\ 2 F , 

and more generally ||C/5V'||^ = ||-B||f- Applying the last equality to B — 
A- A yields 

Loss(UAV, UAV) = \\U(A-A)V\\ 2 p = \\A-A\\ 2 F = Loss(A,A) 

as desired. □ 



In the proof we use the fact that the distribution of matrix W does not 
change under left and right multiplications by orthogonal matrices. We will 
call such distributions orthogonally invariant. 

Definition 2. A random m x n matrix Z has an orthogonally invariant 
distribution if for any orthogonal matrices U and V of appropriate size the 
distribution ofUZV is the same as the distribution of Z . 

In light of Proposition 1 it is natural to consider reconstruction schemes 
those action do not change under orthogonal transformations of the recon- 
struction problem. 

Definition 3. A reconstruction scheme g(-) is orthogonally invariant if for 
any m x n matrix Y , and any orthogonal matrices U and V of appropriate 
size, g(UYV) = Ug{Y)V. 

In general, a good reconstruction method need not be orthogonally in- 
variant. For example, if the signal matrix A is known to be diagonal, then for 
each Y the estimate g(Y) should be diagonal as well, and in this case g(-) is 
not orthogonally invariant. However, as we show in the next theorem, if we 
have no information about the singular vectors of A (either prior information 
or information from the singular values of A), then it suffices to restrict our 
attention to orthogonally invariant reconstruction schemes. 

Theorem 4. Let Y = A + W , where A is a random target matrix. Assume 
that A and W are independent and have orthogonally invariant distributions. 
Then, for every reconstruction scheme g(-), there is an orthogonally invariant 
reconstruction scheme g(-) whose expected loss is the same, or smaller, than 
that ofg(-). 

Proof. Let U be an m x m random matrix that is independent of A and W, 
and is distributed according to Haar measure on the compact group oimxm 
orthogonal matrices. Haar measure is (uniquely) defined by the requirement 
that, for every m x m orthogonal matrix C, both CU and XJC have the 
same distribution as U (c.f. Hofmann and Morris, 2006). Let V be an n x n 
random matrix distributed according to the Haar measure on the compact 
group of n x n orthogonal matrices that is independent of A, W and U. 
Given a reconstruction scheme g(-), define a new scheme 

g(Y) = E[U'<7(UyV')V|Y]. 



It follows from the definition of U and V that g(-) is orthogonally invari- 
ant. The independence of {U, V} and {A, W} ensures that conditioning 
on Y is equivalent to conditioning on {A, W}, which yields the equivalent 
representation 

g(Y) = E[U'0(UYV')V|A,W1. 

Therefore, 

ELoss(A,#(y)) = E||E[U'0(UYV')V-A|A,W]|£ 
< E||U'#(UYV')V - Af F 
= E\\g(UYV') -UAV'Wl, 

the inequality follows from the conditional version of Jensen's inequality ap- 
plied to each term in the sum defining the squared norm. The final equality 
follows from the orthogonality of U and V. The last term in the previous 
display can be analyzed as follows: 



E||#(UyV)-UAV'||^ 

= E 



E 



E(||#(UAV + n- 1/2 UWV) - UAV'IH | U, V, A) 
E(||#(UAV + n- 1/2 W) - UAV'II^ | U, V, A) 



ElUUAV + n- 1,2 W) - UAV'I 



F" 



The first equality follows from the definition of Y; the second follows from 
the independence of W and U, A, V, and the orthogonal invariance of C(W). 
By a similar argument using the orthogonal invariance of £(A), we have 

E||#(UAV + n~ 1/2 W) 
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UAV'Hl 
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The final term above is ELoss(A,5f(y)). This completes the proof. □ 

Based on Theorem 4 will restrict our attention to orthogonally invariant 
reconstruction schemes in what follows. 



As noted in introduction, the singular value decomposition (SVD) of the 
observed matrix Y is a natural starting point for reconstruction of a signal 
matrix A. As we show below, the SVD of Y is intimately connected with 
orthogonally invariant reconstruction methods. An immediate consequence 
of the decomposition Y = UDV is that U'YV = D, so we can diagonalize 
Y by means of left and right orthogonal multiplications. 

The next proposition follows from our ability to diagonalize the signal 
matrix A in the reconstruction problem. 

Proposition 5. Let Y = A + n~ l l 2 W , where W has an orthogonally invari- 
ant distribution. If g(-) is an orthogonally invariant reconstruction scheme, 
then for any fixed signal matrix A, the distribution of Loss(A,g(Y)), and in 
particular KLoss(A, g(Y)) , depends only on the singular values of A. 

Proof. Let UD A V be the SVD of A. Then D A = U'A V, and as the Frobe- 
nius norm is invariant under left and right orthogonal multiplications, 

Loss(A, g(Y)) = \\g(Y)-A\\% = \\ U' (g(Y) - A) V \\ 2 F 



U'g(Y)V-U'AV\\ F = \\g(U'YV)-D A \\ F 

g(D A + n- 1/2 U'WV)-D Anr . 



|2 



The result now follows from the fact that UWV has the same distribution 
as W. □ 

We now address the implications of our ability to diagonalize the observed 
matrix Y . Let g(-) be an orthogonally invariant reconstruction method, and 
let UDV be the singular value decomposition of Y. It follows from the 
orthogonal invariance of g(-) that 

m n 

g{Y) = g{UDV) = Ug(D)V = J]I>M ( 4 ) 

i=i j=i 

where c^ depend only on the singular values of Y. In particular, any orthog- 
onally invariant g(-) reconstruction method is completely determined by how 
it acts on diagonal matrices. The following theorem allows us to substantially 
refine the representation (4). 

Theorem 6. Let g(-) be an orthogonally invariant reconstruction scheme. 
Then g(Y) is diagonal whenever Y is diagonal. 

10 



Proof. Assume without loss of generality that m > n. Let the observed ma- 
trix Y = diag(di,d2,---,d n ), and let A = g(Y) be the reconstructed matrix. 
Fix a row index 1 < k < m. We will show that Akj = for all j ^ k. Let Dl 
be an m x m matrix derived from the identity matrix by flipping the sign of 
the k th diagonal element. More formally, Dl = I — le^ki where e^ is the 
k th standard basis vector in M. m . The matrix Dl is known as a Householder 
reflection. 

Let D R be the top left n x n submatrix of Dl- Clearly DlD' l = I and 
DrD'r — I, so both Dl and Dr are orthogonal. Moreover, all three matrices 
Dl, Y, and D R are diagonal, and therefore we have the identity Y = D L YD R . 
It then follows from the orthogonal invariance of g(-) that 

A = g{Y) = g(D L YD R ) = D L g(Y)D R = D L AD R . 

The (i,j) th element of the matrix DlAD r is Aij(—l) Stk (—l) s ^ k , and therefore 
A^ = —A^ if j 7^ k. As k was arbitrary, A is diagonal. □ 

As an immediate corollary of Theorem 6 and equation (4) we obtain a 
compact, and useful, representation of any orthogonally invariant reconstruc- 
tion scheme g(-). 

Corollary 7. Let g(-) be an orthogonally invariant reconstruction method. If 
the observed matrix Y has singular value decomposition Y = Y2 ^j u i v 'j then 
the reconstructed matrix has the form 

mAn 

A = g(Y) = ]jr W ;, (5) 

where the coefficients Cj depend only on the singular values ofY. 

The converse of Corollary 7 is true under a mild additional condition. 
Let g(-) be a reconstruction scheme such that g(Y) = J2 c j u j v 'ji where 
Cj = Cj(d\, . . . , d mAn ) are fixed functions of the singular values of Y. If 
the functions {cj(-)} are such that Ci(d) = Cj(d) whenever di = dj, then g(-) 
is orthogonally invariant. This follows from the uniqueness of the singular 
value decomposition. 



11 



3 Asymptotic Matrix Reconstruction and 
Random Matrix Theory 

Random matrix theory is broadly concerned with the spectral properties of 
random matrices, and is an obvious starting point for an analysis of ma- 
trix reconstruction. The matrix reconstruction problem has several points 
of intersection with random matrix theory. Recently a number of authors 
have studied low rank deformations of Wigner matrices (Capitaine et al., 
2009; Feral and Peche, 2007; Maida, 2007; Peche, 2006). However, their re- 
sults concern symmetric matrices, a constraint not present in the reconstruc- 
tion model, and are not directly applicable to the reconstruction problem 
of interest here. (Indeed, our simulations of non-symmetric matrices exhibit 
behavior deviating from that predicted by the results of these papers.) A 
signal plus noise framework similar to matrix reconstruction is studied in 
Dozier and Silverstein (2007) and Nadakuditi and Silverstein (2007), how- 
ever both these papers model the signal matrix to be random, while in the 
matrix reconstruction problem we assume it to be non-random. El Karoui 
(2008) considered the problem of estimation the eigenvalues of a population 
covariance matrix from a sample covariance matrix, which is similar to the 
problem of estimation of the singular values of A from the singular values of 
Y. However for the matrix reconstruction problem it is equally important to 
estimate the difference between the singular vectors of A and Y, in addition 
to the estimate of the singular values of A. 

Our proposed denoising scheme is based on the theory of spiked pop- 
ulation models in random matrix theory. Using recent results on spiked 
population models, we establish asymptotic connections between the sin- 
gular values and vectors of the signal matrix A and those of the observed 
matrix Y. These asymptotic connections provide us with finite-sample esti- 
mates that can be applied in a non-asymptotic setting to matrices of small 
or moderate dimensions. 

3.1 Asymptotic Matrix Reconstruction Model 

The proposed reconstruction method is derived from an asymptotic version 
of the matrix reconstruction problem (1). For n > 1 let integers m = m(n) 
be defined in such a way that 

in 

— — > c > as n — > oo. (6) 

n 

12 



For each n let Y, A, and W be m x n matrices such that 



> x '■i 



Y = A + -^W, (7) 

<n 



where the entries of W are independent N(0, 1) random variables. We assume 
that the signal matrix A has fixed rank r > and fixed non-zero singular 
values Ai(A), . . . , X r (A) that are independent of n. The constant c represents 
the limiting aspect ratio of the observed matrices Y. The scale factor n~ 1 ^ 2 
ensures that the singular values of the signal matrix are comparable to those 
of the noise. We note that Model (7) matches the asymptotic model used by 
Capitaine et al. (2009); Feral and Peche (2007) in their study of fixed rank 
perturbations of Wigner matrices. 

In what follows Xj{B) will denote the j'-th singular value of a matrix B, 
and Uj(B) and Vj(B) will denote, respectively, the left and right singular 
values corresponding to Xj(B). Our first proposition concerns the behavior 
of the singular values of Y when the signal matrix A is equal to zero. 

Proposition 8. Under the asymptotic reconstruction model with A = the 
empirical distribution of the singular values Ai(F) > • • • > A mAn (F) con- 
verges weakly to a (non-random) limiting distribution with density 

f Y (s) = -^—J(a-s*)(s*-b), s G [V5, y/b], (8) 

71{C A 1) 

where a = (1 — yfc) 2 and b = (1 + y/c) 2 . Moreover, \i(Y) — > 1 + yfc and 
X m An(Y) — > 1 — \/c as n tends to infinity. 

The existence and form of the density /y(-) are a consequence of the 
classical Marcenko-Pastur theorem (Marcenko and Pastur, 1967; Wachter, 
1978). The in-probability limits of Xi{Y) and A mAn (F) follow from later work 
of Geman (1980) and Wachter (1978), respectively. If c = 1, the density 
function fy(s) simplifies to the quarter-circle law /y(s) = 7t _1 a/4 — s 2 for 

se[o,2]. 

The next two results concern the limiting eigenvalues and eigenvectors 
of Y when A is non-zero. Proposition 9 relates the limiting eigenvalues of 
Y to the (fixed) eigenvalues of A, while Proposition 10 relates the limiting 
singular vectors of Y to the singular vectors of A. Proposition 9 is based on 
recent work of Baik and Silverstein (2006), while Proposition 10 is based on 
recent work of Paul (2007), Nadler (2008), and Lee et al. (2010). The proofs 
of both results are given in Section 5.5. 

13 



Proposition 9. Let Y follow the asymptotic matrix reconstruction model (7) 
with signal singular values Xi(A) > ... > X r (A) > 0. For 1 < j < r, as n 

tends to infinity, 



A ,(Y) J^ I ( 1 + x2 M) + c +i^) 1/2 if *M)>rc 
[l + Vc if < Aj(A) < 



ITie remaining singular values X r +i(Y), . . . , A mAn (F) ofY are associated with 
the zero singular values of A: their empirical distribution converges weakly 
to the limiting distribution in Proposition 8. 

Proposition 10. Let Y follow the asymptotic matrix reconstruction model 
(7) with distinct signal singular values Xi(A) > X 2 (A) > ... > X r (A) > 0. 
Fix j such that Xj(A) > tfc. Then as n tends to infinity, 



[u 3 (Y)MA)) -. (l- /r . ^ A) 



and 

{ V] {Y),v {A)Y 



XKA)J> V \*JA) 



LJ 

Moreover, if k = l,...,r not equal to j then (uj(Y),Uk(A)) — > and 
(vj(Y),Vk{A)) — > as n tends to infinity. 

The limits established in Proposition 9 indicate a phase transition. If 
the singular value Xj(A) is less than or equal to ^fc then, asymptotically, 
the singular value Xj(Y) lies within the support of the Marcenko-Pastur 
distribution and is not distinguishable from the noise singular values. On 
the other hand, if Xj(A) exceeds tfc then, asymptotically, Xj(Y) lies outside 
the support of the Marcenko-Pastur distribution, and the corresponding left 
and right singular vectors of Y are associated with those of A (Proposition 
10). 

4 Proposed Reconstruction Method 

Assume for the moment that the variance a 2 of the noise is known, and equal 
to one. Let Y be an observed mxn matrix generated from the additive model 
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Y = A + n-V 2 W, and let 



mf\n 



y = 5>(y )«,(>>; (y) 

be the SVD of y. Following the discussion in Section 2, we seek an estimate 
A of the signal matrix A having the form 

m/\n 

a = ^ Cj ^(y)^.(y), 

where each coefficient Cj depends only on the singular values Xi(Y),..., 

^mAn(y) of y. We derive A from the limiting relations in Propositions 9 

and 10. By way of approximation, we treat these relations as exact in the 

i l i 

non-asymptotic setting under study, using the symbols = , < and > to 

denote limiting equality and inequality relations. 

Suppose initially that the singular values and vectors of the signal matrix 

A are known. In this case we wish to find coefficients {cj} minimizing 

mAn r 

Loss(A,A) = || £ cjUjiYWjiY) - ^ X^u^v'^A)^. 

3=1 3=1 

Proposition 9 shows that asymptotically the information about the singular 
values of A that are smaller that ^fc is not recoverable from the singular 
values of Y . Thus we can restrict the first sum to the first tq = #{j : 
Xj(A) > tfc} terms 

rp r 

loss(a,a) = nx>i%(yKon - E A ^)%W(^)I&- 

3=1 3=1 

Proposition 10 ensures that the left singular vectors Uj(Y) and Uk(A) are 
asymptotically orthogonal for k = 1, . . . ,r not equal to j = 1, . . . ,r , and 
therefore 

ro r 

Loss(A,l) I YhMy)^) - X 3 {A)u 3 {AmA)\\ 2 F + ^ X^A). 

3=1 3=ro+l 
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Fix 1 < j < r Q . Expanding the j-th term in the above sum gives 
WX^u^A^A) - C] u 3 (Y)v' {Y)\\ 2 F 

= 3hsV)W)tF + ^M)\H A >'M)\\l 

- 2c 3 X 3 (A)(u 3 (A)v' 3 (A),u 3 (Y)v' 3 (Y)) 
= \){A) + c) - 2c 3 X 3 (A)(u 3 (A),u 3 (Y))(v 3 (A),v 3 (Y)). 
Differentiating the last expression with respect to c 3 yields the optimal value 
c* = \ 3 {A) ( Uj (A),u 3 {Y)) (v 3 (A),v 3 (Y)). (9) 

In order to estimate the coefficient c* we consider separately singular 

values of Y that are at most, or greater than 1 + \/c, where c = m/n is the 

l 
aspect ratio of Y. By Proposition 9, the asymptotic relation Xj(Y) < 1 + a/c 

implies Xj{A) < tfc, and in this case the j-th singular value of A is not 

recoverable from Y. Thus if Xj(Y) < 1 + \/c we set the corresponding 

coefficient c* = 0. 

On the other hand, the asymptotic relation Xj{Y) > 1 + a/c implies that 
Xj(A) > -\/c, and that each of the inner products in (9) are asymptotically 
positive. The displayed equations in Propositions 9 and 10 can then be 
used to obtain estimates of each term in (9) based only on the (observed) 
singular values of Y and its aspect ratio c. These equations yield the following 
relations: 



A?(A) = \ 



\ 2 3 {Y)-{l + c) + J[\){Y)-{l + c)Y-Ac 



estimates XAA) 



0- = ( 1 - ^— | / 11 + ^—] estimates (uAA^.uAY))' 

1 ma) ) ' I >?M) ' ' ' 



1 - ^— I / ( 1 + ~ I estimates (vAA^.vAY))'' 

\%A) ' \ XV A) ' ..>■>■. 



With these estimates in hand, the proposed reconstruction scheme is defined 
via the equation 

G RMT {Y) = J2 \ 3 {A)e 3 ^ U] {Y)v' 3 {Y\ (10) 

Aj(Y)>l+V5 
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where Xj{A), 9j, and (f)j are the positive square roots of the estimates defined 
above. 

The RMT method shares features with both hard and soft thresholding. 
It sets to zero singular values of Y smaller than the threshold (1 + \/c), 
and it shrinks the remaining singular values towards zero. However, unlike 
soft thresholding the amount of shrinkage depends on the singular values, 
the larger singular values are shrunk less than the smaller ones. This latter 
feature is similar to that of LASSO type estimators based on an L q penalty 
with < q < 1 (also known as bridge estimators Fu, 1998). It is important 
to note that, unlike hard and soft thresholding schemes, the proposed RMT 
method has no tuning parameters. The only unknown, the noise variance, is 
estimated within the procedure. 

In the general version of the matrix reconstruction problem, the variance 
a 2 of the noise is not known. In this case, given an estimate a 2 of a 2 , such 
as that described below, we may define 

G RMT (Y) = a G* MT Cp) , (11) 

where G RMT (-) is the estimate defined in (10). 

4.1 Estimation of the Noise Variance 

Let Y be derived from the asymptotic reconstruction model Y = A + 
orT x l 2 W with sigma unknown. While it is natural to try to estimate a 
from the entries of Y, the following general results indicate that, under mild 
conditions, it is sufficient to consider estimates based on the singular values 
of Y. The results and their proofs parallel those in Section ??. 

Definition 11. A function s(-) : ]R mxn — > R is orthogonally invariant if for 
any m x n matrix Y and any orthogonal matrices U and V of appropriate 

sizes, s{Y) = s{UYV). 

Proposition 12. A function s(-) : ]R mxn — > R is orthogonally invariant if 
and only if s(Y) depends only on the singular values ofY. 

Proposition 13. Let s(-) : ]R mxn — > R. Then there is an orthogonally invari- 
ant function §(■) with the following property. Let A and W be independent 
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m x n random matrices with orthogonally invariant distributions, and let 

Y = A + on~ x l 2 W for some a . Then s{Y) has the same expected value as 
s(Y) and a smaller or equal variance. 

Based Propositions 12 and 13 we restrict our attention to the estimates of 
a that depend only on the singular values of Y. It follows from Proposition 9 
that the empirical distribution of the (m — r) singular values S = {Xj(Y/a) : 
Xj(A) = 0} converges weakly to a distribution with density (8) supported on 
the interval [|1 — y/c\, 1 + y/c\. Following the general approach outlined in 
Gyorfi et al. (1996), we estimate a by minimizing the Kolmogorov-Smirnov 
distance between the observed sample distribution of the singular values of 

Y and that predicted by theory. Let F be the CDF of the density (8). For 
each a > let S a be the set of singular values \j{Y) that fall in the interval 
[er|l — a/c|, <t(1 + y/c)], and let F a be the empirical CDF of S a . Then 

K(a) = snp\F(s/a)-F a (s)\ 

s 

is the Kolmogorov-Smirnov distance between the empirical and theoretical 
singular value distribution functions, and we define 

a(Y) = argmin K(a) (12) 

cr>0 

to be the value of a minimizing K(cr). A routine argument shows that the 
estimator a is scale invariant, in the sense that a(j3Y) = f3a(Y) for each 
/3>0. 

By considering the jump points of the empirical CDF F a (s), the supre- 
mum in K(a) simplifies to 



K(a) = max 



r,/ / N i~ 1/2 

F( Sl /a) - 






2\SJ 



where {sj} are the ordered elements of S a . The objective function K(a) is 
discontinuous at points where the S a changes, so we minimize it over a fine 
grid of points a in the range where \S a \ > (mAn)/2 and a(l + y / c) < 2Xi(Y). 
The closed form of the cumulative distribution function F(-) is presented in 
Section 5.4. 



5 Simulations 

We carried out a simulation study to evaluate the performance of the RMT 
reconstruction scheme G {•) defined in (11) using the variance estimate 
a in (12). The study compared the performance of G RMT {) to three al- 
ternatives: the best hard thresholding reconstruction scheme, the best soft 
thresholding reconstruction scheme, and the best orthogonally invariant re- 
construction scheme. Each of the three competing alternatives is an oracle- 
type procedure that is based on information about the signal matrix A that 
is not available to G RMT (-). 

5.1 Hard and Soft Thresholding Oracle Procedures 

Hard and soft thresholding schemes require specification of a threshold pa- 
rameter that can depend on the observed matrix Y . Estimation of the noise 
variance can be incorporated into the choice of the threshold parameter. In 
order to compare the performance of G RMT (-) against every possible hard 
and soft thresholding scheme, we define oracle procedures 

G H (Y) = g*(Y) where A* = argmin \\A - gj?(Y)\\l (13) 

A>0 

G S (Y) = g s v ,(Y) where u* = argmin \\A - g%(Y) II I (14) 

that make use of the signal A. By definition, the loss \\A — G H (Y)\\ 2 F of 
G H (Y) is less than that of any hard thresholding scheme, and similarly the 
loss of G S (Y) is less than that of any soft thresholding procedure. In effect, 
the oracle procedures have access to both the unknown signal matrix A and 
the unknown variance a. They are constrained only by the form of their 
respective thresholding families. The oracle procedures are not realizable in 
practice. 

5.2 Orthogonally Invariant Oracle Procedure 

As shown in Corrolary 7, every orthogonally invariant reconstruction scheme 
g(-) has the form 

mAn 
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where the coefficients Cj are functions of the singular values of Y . 

The orthogonally invariant oracle scheme has coefficients c° minimizing 
the loss 

m/\n 

\\A-J2^n 3 (Y)v 3 (Y)f F 

3=1 

over all choices Cj. As in the case with the hard and soft thresholding oracle 
schemes, the coefficients rf depend on the signal matrix A, which in practice 
is unknown. 

The (rank one) matrices {uj(Y)vj(Y)'} form an orthonormal basis of 
an m A n-dimensional subspace of the mn-dimensional space of all m x n 
matrices. Thus the optimal coefficient c° is simply the matrix inner product 
(A, Uj(Y)vj(Y)'), and the orthogonally invariant oracle scheme has the form 
of a projection 

mAn 

G\Y) = £ (A,u j (Y>AY)')u j (Y)v j (Y)'. (15) 

i=i 

By definition, for any orthogonally invariant reconstruction scheme g(-) and 
observed matrix Y, we have \\A - G*(Y)\\% < \\A - g{Y)\\%. 

5.3 Simulations 

We compared the reconstruction schemes G H (Y), G (Y), and G RMT (Y) to 
G*(Y) on a wide variety of signal matrices generated according to the model 
(1). As shown in Proposition 5, the distribution of the loss \\A — G(Y)\\ 2 F 
depends only on the singular values of A, so we considered only diagonal 
signal matrices. As the variance estimate used in G RMT (-) is scale invariant, 
all simulations were run with noise of unit variance. (Estimation of noise 
variance is not necessary for the oracle reconstruction schemes.) 

5.3.1 Square Matrices 

Our initial simulations considered 1000 x 1000 square matrices. Signal matri- 
ces A were generated using three parameters: the rank r; the largest singular 
value Ai(v4); and the decay profile of the remaining singular values. We con- 
sidered ranks r G {1, 3, 10, 32, 100} corresponding to successive powers of a/10 
up to (mAn)/10, and maximum singular values Xi(A) G {0.9, 1, 1.1, ..., 10}^c 
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falling below and above the critical threshold of \fc = 1. We considered sev- 
eral coefficient decay profiles: (i) all coefficients equal; (ii) linear decay to 
zero; (iii) linear decay to \i(A)/2; and (iv) exponential decay as powers of 
0.5, 0.7, 0.9, 0.95, or 0.99. Independent noise matrices W were generated for 
each signal matrix A. All reconstruction schemes were then applied to the 
resulting matrix Y = A + n~ l / 2 W . The total number of generated signal 
matrices was 3,680. 
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Figure 3: Relative performance of soft and hard thresholding method against 
the orthogonally invariant oracle for 1000 x 1000 matrices. 



Figures 3 and 4 illustrate, respectively, the loss of the best soft thresh- 
olding, best hard thresholding and RMT reconstruction methods (y axis) 
relative to the best orthogonally invariant scheme (x axis). In each case 
the diagonal represents the performance of the orthogonally invariant oracle: 
points farther from the diagonal represent worse performance. The plots show 
clearly that G RMT (-) outperforms the oracle schemes G H () and G s (-), and 
has performance comparable to that of the orthogonally invariant oracle. In 
particular, G R (■) outperforms any hard or soft thresholding scheme, even 
if the latter schemes have access to the unknown variance a and the signal 
matrix A. 

In order to summarize the results of our simulations, for each scheme G(-) 
and for each matrix Y generated from a signal matrix A we calculated the 
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Relative performance ol orthogonally invariant oracle and RMT method 




Figure 4: Relative performance of RMT method and orthogonally invariant 
oracle method for 1000 x 1000 matrices. 



relative excess loss of G() with respect to G*(): 



REL(A,G(Y)) 



Loss(A, G(Y)) 

Loss(A,G*(Y)) 



(16) 



The definition of G*() ensures that relative excess loss is non- negative. The 
average REL of G s (-), G H (), and G RMT (-) across the 3680 simulated 1000 x 
1000 matrices was 68.3%, 18.3%, and 0.61% respectively. Table 1 summa- 
rizes these results, and the results of analogous simulations carried out on 
square matrices of different dimensions. The table clearly shows the strong 
performance of RMT method for matrices with at least 50 rows and columns. 
Even for m = n = 50, the average relative excess loss of the RMT method 
is almost twice smaller then those of the oracle soft and hard thresholding 
methods. 



5.3.2 Rectangular Matrices 

We performed simulations for rectangular matrices of different dimensions 
m, n and different aspect ratios c = m/n. For each choice of dimensions 
m, n we simulated target matrices using the same rules as in the square case: 
rank r G {1, 3, 10, 32, . . .} not exceeding (mAn)/10, maximum singular values 
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Table 1: Average relative excess losses of oracle soft thresholding, oracle 
hard thresholding and the proposed RMT reconstruction method for square 
matrices of different dimensions. 



Matrix size (square) 
Scheme 



Qrmt( 



2000 
0.740 



0.182 
O003~ 



1000 
0.683 



0.183 
OXKKT 



500 
0.694 



100 
0.611 



0.178 
0.008 



0.179 
0.029 



50 
0.640 



0.176 
0X)7T 



Ai(.A) G {0.9, 1, 1.1, ..., 10} ^c, and coefficients decay profiles like those above. 
A summary of the results is given in Table 2, which shows the average REL 
for matrices with 2000 rows and 10 to 2000 columns. Although random 
matrix theory used to construct the RMT scheme requires m and n to tend 
to infinity and at the same rate, the numbers in Table 2 clearly show that the 
performance of the RMT scheme is excellent even for small n, where average 
REL ranges between 0.3% and 0.54%. The average REL of soft and hard 
thresholding are above 18% in each of the simulations. 

Table 2: Average relative excess loss of oracle soft thresholding, oracle hard 
thresholding, and RMT reconstruction schemes for matrices with different 
dimensions and aspect ratios. 



Matrix 
size 


m 


2000 


2000 


2000 


2000 


2000 


2000 


n 


2000 


1000 


500 


100 


50 


10 


Method 


Gs{.) 


0.740 


0.686 


0.653 


0.442 


0.391 


0.243 


GH-) 


0.182 


0.188 


0.198 


0.263 


0.292 


0.379 


G RMT (-) 


0.003 


0.004 


0.004 


0.004 


0.004 


0.005 
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Appendix 

5.4 Cumulative Distribution Function for Variance 
Estimation 

The cumulative density function F(-) is calculated as the integral of f n -i/2 W (s). 
For c = 1 (a = 0, b = 4) it is a common integral 



f(s)ds = — \Jb — s 2 ds = — ( x\/A — x 2 + 4 arcsin 
> ^ Jo 2vr v 



For c 7^ 1 the calculations are more complicated. First we perform the change 
of variables t = s 2 , which yields 



/X fX 

f(s)ds = C s- 2 ^{b-s 2 ){s 2 -a)d 
fa J \/a 

o 

= C t^y/ib-t^t-a^dt, 



s 2 



(I 



where C = 1/(2tt(c A 1)). 

Next we perform a change of variables y = t — [a + b]/2 to make the 
expression in the square root look like h 2 — x 2 , giving 

m ,_ r f x2 - [a+b]/2 y/([b-a]/2-y)(y+[b-a]/2) J 
{X) ~ J- [b - a]/2 y+[a + b]/2 dy 

= c r i+c) ^idy, 

J-2^c y + l + c 

The second equality above uses the fact that a + b = 2(1 + c) and b — a = 4a/c. 
The simple change of variables y = 2^fcz is performed next to make the 
numerator \f\ — z 2 : 

x 2 -(l+c)]/2v^ 



n, i>[x*-(l+c)]/2y/c ^l 



Fix) = —. / ; — ; r=dz 

V ; u{cM) J_ x z + {l + c)/2^c 



Next, the formula 



vf 



z 



2 



-dw = vl — z 2 + 



oarcsm(z) 
z + q H V ' 

qz + 1 



\/q 2 — 1 arctan 



vV-i)(i-* 2 ) 
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is applied to find the closed form of F(x) by substituting z = [x 2 — (1 + 
c)]/2y / c and q — (1 -+- c)/2y/c. The final expression above can be simplified 
as vV-1 = v/[(l + c)/2v^] 2 - 1 = |1 - c\/2y/B. 

5.5 Limit Theorems for Asymptotic Matrix 
Reconstruction Problem 

Propositions 9 and 10 in Section 3 provide an asymptotic connection between 
the eigenvalues and eigenvectors of the signal matrix A and those of the 
observed matrix Y. Each proposition is derived from recent work in random 
matrix theory on spiked population models. Spiked population models were 
introduced by Johnstone (2001). 

5.5.1 The Spiked Population Model 

The spiked population model is formally defined as follows. Let r > 1 and 
constants T\ > ■ ■ ■ > r r > 1 be given, and for n > 1 let integers m = m(n) 
be defined in such a way that 

m 

— — t- c > as n — t- oo. (17) 

n 

For each n let 

T = diag(ri,...,r r ,l,...,l) 

be an m x m diagonal matrix (with m = m(n)), and let X be an m x n 
matrix with independent iV m (0, T) columns. Let T = n~ l XX' be the sample 
covariance matrix of X. 

The matrix X appearing in the spiked population model may be decom- 
posed as a sum of matrices that parallel those in the matrix reconstruction 
problem. In particular, X can be represented as a sum 

X = X x + Z, (18) 

where Xi has independent N m (0,T — I) columns, Z has independent N(0, 1) 
entries, and X\ and Z are independent. It follows from the definition of T 
that 

(T-I) = diag^-l,...,^-!^,...^), 
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and therefore the entries in rows r + 1, . . . , m of Xi are equal to zero. Thus, 
the sample covariance matrix T\ = n~ l X\X[ of X\ has the simple block form 



T 



Tn 







where Tu is an r x r matrix equal to the sample covariance of the first r rows 
of Xi. It is clear from the block structure that the first r eigenvalues of T\ are 
equal to the eigenvalues of T u , and that the remaining (m — r) eigenvalues 
of Ti are equal to zero. The size of Tu is fixed, and therefore as n tends to 
infinity, its entries converge in probability to those of diag(ri — 1, . . . , r r — 1). 
In particular, 

W^X', - (T - I)\\ 2 F A 0. (19) 

Consequently, for each j = 1, . . . , r, as n tends to infinity 

X^n-^X,) = A,(fi) = A^fn) At^-1 (20) 

and 

(%(fii), ei ) 2 A 1, (21) 

where Cj is the j-th canonical basis element in W. An easy argument shows 
that Uj(n~ l l 2 Xi) = Uj(Ti), and it then follows from (21) that 

{utin-^X^etf A 1, (22) 

where Cj is the j-th canonical basis element in R m . 

5.5.2 Proof of Proposition 9 

Proposition 9 is derived from existing results on the limiting singular values 
of T in the spiked population model. These results are summarized in the 
following theorem, which is a combination of Theorems 1.1, 1.2 and 1.3 in 
Baik and Silverstein (2006). 

Theorem A. // T is derived from the spiked population model with param- 
eters Ti, . . . , 7> > 1, then for j — 1, . . . , r, as n — > oo 

j[ ' I (1 + v^) 2 if Kr^l + Vc 
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The remaining sample eigenvalues A r+ i(T), . . . , A mAn (T) are associated with 
the unit eigenvalues ofT, their empirical distribution converges weakly to the 
Marcenko-Pastur distribution. 



We also require the following inequality of Mirsky (1960). 



Theorem B. If B and C are mxn matrices then Y^ii^ji^) ~ / \?'(-^)] 



< 



IC-E" 2 



F- 



Proof of Proposition 9. Fix n > 1 and let Y follow the asymptotic recon- 
struction model (7), where the signal matrix A has fixed rank r and non-zero 
singular values \i(A), . . . , X r (A). Based on orthogonal invariance of the ma- 
trix reconstruction problem, without loss of generality, we will assume that 
the signal matrix A = diag(Ai(A), . . . , X r (A), 0, . . . , 0). 

We begin by considering a spiked population model whose parameters 
match those of the matrix reconstruction model. Let X have the same di- 
mensions as Y and be derived from a spiked population model with covariance 
matrix T having r non-unit eigenvalues 



T; 



, -A 2 (A) + 1, j = l,...,r. (23) 



As noted above, we may represent X as X = Xi + Z, where Xi has inde- 
pendent N(0,T — I) columns, Z has independent iV(O.l) entries and X\ is 
independent of Z. Recall that the limiting relations (19)-(22) hold for this 
representation. 

The matrix reconstruction problem and spiked population model may be 
coupled in a natural way. Let random orthogonal matrices U\ and V\ be 
defined for each sample point in such a way that UxD{V{ is the SVD of X\. 
By construction, the matrices U\, V% depend only on X\, and are therefore 
independent of Z. Consequently U[ZVi has the same distribution as Z. If 
we define W = U[ZVi, then Y = A + n~ x l 2 W has the same distribution as 
the observed matrix Y in the matrix reconstruction problem. 

We apply Mirsky's theorem with B = Y and C = n~ x ' 2 U' x XV\ in order to 



27 



bound the difference between the singular values of Y and those of n X I 2 X: 



mAn 



£ [XAn- 1/2 X) - X^Y)] < Wn-^UiXV, - Y 



rv 1 1 2 

F 



[n 



XI2 U[X X V X - A) + n- 1/2 (U[ZV x - W) 



|2 



Wn-^U'^V, - A\\ 2 p 

mAn 



'£[\ s {n- 1 ' a U l 1 X 1 V 1 )-\ s (A)] : 
0=1 

mAn 

J2Mn- 1/2 X l )-X j (A)} 2 . 



3=1 

The first inequality above follows from Mirsky's theorem and the fact that 
the singular values of rC x l 2 X and n~ 1 ^ 2 U[XVi are the same, even though 
U\ and V\ may not be independent of X. The next two equalities follow by 
expanding X and Y, and the fact that W = U[ZV\. The third equality is a 
consequence of the fact that both U[X\Vi and A are diagonal, and the final 
equality follows from the equality of the singular values of X\ and U[X\V\. 
In conjunction with (20) and (23), the last display implies that 

2 P 



^•(n-V^-A.Cn^O 



Thus the distributional and limit results for the eigenvalues of T = n~ l XX' 
hold also for the eigenvalues of YY', and therefore for YY' as well. The 
relation Xj(Y) = y/Xj(YY') completes the proof. □ 

5.5.3 Proof of Proposition 10 

Proposition 10 may be derived from existing results on the limiting singular 
vectors of the sample covariance T in the spiked population model. These 
results are summarized in Theorem C below. The result was first established 
for Gaussian models and aspect ratios < c < 1 by Paul (2007). Nadler 
(2008) extended Paul's results to c > 0. Recently Lee et al. (2010) further 
extended the theorem to c > and non-Gaussian models. 
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Theorem C. IfTis derived from the spiked population model with distinct 
parameters n > • • • > r r > 1, then for 1 < j < r, 

[ tf l<Tj<l + v/c 



Moreover, for Tj > 1 + \/c and fe 7^ j snc/i £/ia£ 1 < k < r we have 

( Uj (T),u k (T)) 2 ^0. 



Although the last result is not explicitly stated in Paul (2007), it follows 
immediately from the central limit theorem for eigenvectors (Theorem 5, 
Paul, 2007). 

We also require the following result, which is a special case of an inequality 
of Wedin (Wedin, 1972; Stewart, 1991). 

Theorem D. Let B and C be m x n matrices and let 1 < j < m A n. If 

the j-th singular value of C is separated from the singular values of B and 
bounded away from zero, in the sense that for some 5 > 

mm\\j(C) - X k (B)\ > 5 and Aj(C) > 5 



then 



X(5),n,(C)> 2 + {v 3 {B), V] {C)) 2 > 2- 2llB f fF . 



Proof of Proposition 10: Fix n > 1 and let Y follow the asymptotic recon- 
struction model (7), where the signal matrix A has fixed rank r and non-zero 
singular values \i(A), . . . , X r (A). Assume without loss of generality that 
A = diag(Ai(A),...,A r (A), 0,...,0). 

We consider a spiked population model whose parameters match those 
of the matrix reconstruction problem and couple it with the matrix recon- 
struction model exactly as in the proof of Proposition 9. In particular, the 
quantities Tj, T, X, X 1? Z, Ui,Vi, W, and Y are as in the proof of Proposition 
9 and the preceding discussion. 

Fix an index j such that Xj(A) > tfc and thus Tj > 1 + \fc. We apply 
Wedin's theorem with B = Y and C = n~ 1 ' 2 U[XVi. There is 5 > such 
that both conditions of Wedin's theorem are satisfied for the given j with 
probability converging to 1 as n — > 00. The precise choice of S is presented 
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at the end of this proof. It follows from Wedin's theorem and inequality 
(vjiB^VjiC)) 2 < 1 that 

{u j {Y),u j {n- 1 l 2 U[XV 1 )) 2 = {u 3 {B),u 3 {C)) 2 > ! _ gllg^gi 

It is shown in the proof of Proposition 9 that \\B — C\\% = \\n~ 1 ^ 2 U[XVi — 
Y\\ 2 F — > as n ->■ oo. Substituting Uj(n~^ 2 U[XV) = U[uj(X) then yields 

(u,(Y),U[ Uj (X)) 2 A 1. (24) 

Fix k — 1, . . . ,r. As Tj > 1 + \/c Theorem C shows that (uj(T), e k ) 2 has 
a non-random limit in probability, which we will denote by 6 2 k . The relation 
Tj = A 2 (A) + 1 implies that fl 2 fc = [1 - cA" 4 (A)]/[l + cA~ 2 (A)] if j = fc, 
and # 2 fc = otherwise. As Uj(T) = Uj(X), it follows that 

< Mj (A),e fc > 2 ^> 0f fe . 

Recall that the matrix JJ\ consists of the left singular vectors of X±, i.e. 
Uk{Xi) = U\e^. It is shown in (22) that (Uie k ,e k ) 2 = (u k (Xi),e k ) 2 — > 1, so 
we can replace e k by U\ek in the previous display to obtain 

(u 3 (X),U iek ) 2 A 6%. 
It the follows from the basic properties of inner products that 



<U[ Uj (X),e k ) 2 ^ e 2 



jk- 

Using the result (24) of Wedin's theorem we may replace the left term in the 
inner product by Uj(Y), which yields 

{ Uj {Y),e k ) 2 ^ 6%. 

As A = diag(Ai(A), . . . , X r (A), 0, . . . , 0) we have e k = u k (A). By construction 
the matrix Y has the same distribution as Y , so it follows from the last display 
that 

(u,(Y),u k (A)) 2 A 6%, 

which is equivalent to the statement of Proposition 10 for the left singular 
vectors. The statement for the right singular vectors follows from considera- 
tion of the transposed reconstruction problem. 
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Now we find such 5 > that for the fixed j the conditions of Wedin's the- 
orem are satisfied with probability going to 1. It follows from Proposition 9 
that for k — 1 , . . . , r the k-th singular value of Y has a non-random limit in 
probability 

A* = ]imX k {n~ 1/2 X) = limX k (Y). 

Let r be the number of eigenvalues of A such that \k{A) > ^fc (i.e. the 
inequality holds only for k = 1, . . . ,ro). It follows from the formula for A£, 
that \* k > 1 + y/c for k = 1, . . . , r . Note also that in this case X* k is a strictly 
increasing function of Xk{A). All non-zero Xj(A) are distinct by assumption, 
so all X* k are distinct for k = 1, ... , r . Note that A* o+1 = 1 + y/c is smaller 
that A* . Thus the limits of the first r singular values of Y are not only 
distinct, they are bounded away from all other singular values. Define 

6 = - min (A* - A* +1 ) > 0. 

<j fe=l,...,ro 

For any k — 1, . . . , r + 1 the following inequalities are satisfied with proba- 
bility going to 1 as n — )■ oo 

\X k (Y) - X* k \ < 5 and \X k (n- 1/2 X) - X* k \ < 5. (25) 

In applying Wedin's theorem to B = Y and C = n~ 1 / 2 U[XVi we must verify 
that for any j = 1, . . . ,7*0 its two conditions are satisfied with probability 
going to 1. The first condition is Xj(C) > 5. When inequalities (25) hold 

A,(C) = X j {n' l,2 U[XV 1 ) = Un-^X) > X* - 5 
> (A* - A* +1 ) - 5 > 35 - 5 = 25, 

so the first condition is satisfied with probability going to 1. The second 
condition is \Xj(C) — X k (B)\ > 5 for all k 7^ j. It is sufficient to check the 
condition for k = 1, . . . ,r + 1 as asymptotically Xj{C) > X ro+ i(B). From 
the definition of 5 and the triangle inequality we get 

3<5 < \X*-X* k \ < IX^-X^n-^X)] + MrT^X) - X k {Y)\ + |A fe (f)-A*|. 

When inequalities (25) hold the first and the last terms on the right hand 
side sum are no larger than 5, thus 

35 < 5 + \Xj(n- 1/2 X) - X k (Y)\ + 5. 

It follows that the second condition \Xj(C)-X k (B)\ = \Xj(n~ 1/2 X)-X k (Y)\ > 
5 also holds with probability going to 1. □ 

31 



References 

Alter, O., Brown, P., and Botstein, D. 2000. Singular value de- 
composition for genome-wide expression data processing and modeling. 
Proceedings of the National Academy of Sciences 97, 18, 10101. 

Baik, J. and Silverstein, J. 2006. Eigenvalues of large sample covari- 
ance matrices of spiked population models. Journal of Multivariate Anal- 
ysis 97, 6, 1382-1408. 

Bunea, F., She, Y., and Wegkamp, M. 2010. Adaptive Rank Penalized 
Estimators in Multivariate Regression. Arxiv preprint arXiv : 1004- 2995 . 

CANDES, E. AND Recht, B. Exact matrix completion via convex optimiza- 
tion. Foundations of Computational Mathematics, 1-56. 

Candes, E., Romberg, J., and Tao, T. 2006. Robust uncertainty princi- 
ples: Exact signal reconstruction from highly incomplete frequency infor- 
mation. IEEE Transactions on information theory 52, 2, 489-509. 

Capitaine, M., Donati-Martin, C, and Feral, D. 2009. The largest 
eigenvalue of finite rank deformation of large Wigner matrices: convergence 
and non-universality of the fluctuations. The Annals of Probability 37, 1, 
1-47. 

Donoho, D. 2006. Compressed sensing. IEEE Transactions on Information 
Theory 52, 4, 1289-1306. 

Dozier, R. and Silverstein, J. 2007. On the empirical distribution 
of eigenvalues of large dimensional information-plus-noise-type matrices. 
Journal of Multivariate Analysis 98, 4, 678-694. 

El Karoui, N. 2008. Spectrum estimation for large dimensional covariance 
matrices using random matrix theory. The Annals of Statistics 36, 6, 
2757-2790. 

Feral, D. and Peche, S. 2007. The largest eigenvalue of rank one de- 
formation of large Wigner matrices. Communications in Mathematical 
Physics 272, 1, 185-228. 

Fu, W. 1998. Penalized regressions: the bridge versus the lasso. Journal of 
computational and graphical statistics 7, 3, 397-416. 

32 



Geman, S. 1980. A limit theorem for the norm of random matrices. The 
Annals of Probability 8, 2, 252-261. 

Gy6r.fi, L., Vajda, I., and Van Der Meulen, E. 1996. Minimum Kol- 
mogorov distance estimates of parameters and parametrized distributions. 
Metrika 43, 1, 237-255. 

Hofmann, K. AND Morris, S. 2006. The structure of compact groups: a 
primer for the student, a handbook for the expert. Walter De Gruyter Inc. 

HOLTER, N., MlTRA, M., MARITAN, A., ClEPLAK, M., BANAVAR, J., AND 

Fedoroff, N. 2000. Fundamental patterns underlying gene expression 
profiles: simplicity from complexity. Proceedings of the National Academy 
of Sciences 97, 15, 8409. 

Johnstone, I. 2001. On the distribution of the largest eigenvalue in prin- 
cipal components analysis. The Annals of Statistics 29, 2, 295-327. 

Konstantinides, K., Natarajan, B., and Yovanof, G. 1997. Noise 
estimation and filtering using block-based singular value decomposition. 
IEEE Transactions on Image Processing 6, 3, 479-483. 

Lee, S., Zou, F., and Wright, F. A. 2010. Convergence and Prediction 
of Principal Component Scores in High-Dimensional Settings. The Annals 
of Statistics. 

Maida, M. 2007. Large deviations for the largest eigenvalue of rank one 
deformations of Gaussian ensembles. The Electronic Journal of Probabil- 
ity 12, 1131-1150. 

Marcenko, V. and Pastur, L. 1967. Distribution of eigenvalues for some 
sets of random matrices. USSR Sbornik: Mathematics 1, 4, 457-483. 

Mirsky, L. 1960. Symmetric gauge functions and unitarily invariant norms. 
The Quarterly Journal of Mathematics 11, 1, 50. 

Nadakuditi, R. and Silverstein, J. 2007. Fundamental limit of sample 
eigenvalue based detection of signals in colored noise using relatively few 
samples. Signals, Systems and Computers, 686-690. 



33 



Nadler, B. 2008. Finite sample approximation results for principal com- 
ponent analysis: A matrix perturbation approach. The Annals of Statis- 
tics 36, 6, 2791-2817. 

Negahban, S. and Wainwright, M. 2009. Estimation of (near) low- 
rank matrices with noise and high-dimensional scaling. Arxiv preprint 
arXiv.0912.5100. 

Paul, D. 2007. Asymptotics of sample eigenstructure for a large dimensional 
spiked covariance model. Statistica Sinica 17, 4, 1617. 

Peche, S. 2006. The largest eigenvalue of small rank perturbations of Her- 
mitian random matrices. Probability Theory and Related Fields 134, 1) 
127-173. 

Raychaudhuri, S., Stuart, J., and Altman, R. 2000. Principal com- 
ponents analysis to summarize microarray experiments: Application to 
sporulation time series. In in Pacific Symposium on Biocomputing. 452- 
463. 

Stewart, G. 1991. Perturbation theory for the singular value decomposi- 
tion. SVD and Signal Processing, 11: Algorithms, Analysis and Applica- 
tions, 99-109. 

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, 
T., Tibshirani, R., Botstein, D., and Altman, R. 2001. Missing 
value estimation methods for DNA microarrays. Bioinformatics 17, 6, 
520. 

Wachter, K. 1978. The strong limits of random matrix spectra for sample 
matrices of independent elements. The Annals of Probability 6, 1, 1-18. 

Wall, M., Dyck, P., and Brettin, T. 2001. SVDMAN-singular value 
decomposition analysis of microarray data. Bioinformatics 17, 6, 566. 

Wedin, P. 1972. Perturbation bounds in connection with singular value 
decomposition. BIT Numerical Mathematics 12, 1, 99-111. 

Wongsawat, Y., Rao, K., and Oraintara, S. Multichannel SVD-based 
image de-noising. 



34 



